{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "p5b0PzpW1xJq"
      },
      "source": [
        "![nebullvm nebuly AI accelerate inference optimize DeepLearning](https://user-images.githubusercontent.com/38586138/201391643-a80407e5-2c28-409c-90c9-327795cd27e8.png)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Accelerate ONNX ResNet50 with Speedster"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "T9xuwZEHzN2K"
      },
      "source": [
        "Hi and welcome 👋\n",
        "\n",
        "In this notebook we will discover how in just a few steps you can speed up the response time of deep learning model inference using the Speedster app from the open-source library `nebullvm`.\n",
        "\n",
        "We will\n",
        "1. Install Speedster and the deep learning compilers used by the library.\n",
        "2. Speed up an ONNX ResNet50 without any loss of accuracy.\n",
        "3. Achieve faster acceleration on the same model by applying more aggressive optimization techniques (e.g. pruning, quantization) under the constraint of sacrificing up to 2% accuracy.\n",
        "\n",
        "Let's jump to the code."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "5Yc5KYo_YzE8"
      },
      "outputs": [],
      "source": [
        "%env CUDA_VISIBLE_DEVICES=0"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "HbFy2Aykz2Qo"
      },
      "source": [
        "# Installation"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "48aljCHu14-H",
      "metadata": {
        "id": "48aljCHu14-H"
      },
      "source": [
        "Install Speedster:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "QFQh3BVr1-GO",
      "metadata": {
        "id": "QFQh3BVr1-GO"
      },
      "outputs": [],
      "source": [
        "!pip install speedster"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "8a7a86b3",
      "metadata": {
        "id": "8a7a86b3"
      },
      "source": [
        "Install deep learning compilers:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "cffbfa32",
      "metadata": {
        "id": "cffbfa32"
      },
      "outputs": [],
      "source": [
        "!python -m nebullvm.installers.auto_installer --frameworks onnx --compilers all"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "N5RXHoZl0p3p"
      },
      "source": [
        "# Optimization example with ONNX"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-Ju-VcRH01Mw"
      },
      "source": [
        "In the following example we will try to optimize a standard ONNX resnet50.\n",
        "\n",
        "Speedster can accelerate neural networks without loss of a user-defined precision metric, e.g. accuracy, or can achieve faster acceleration by applying more aggressive optimization techniques, such as pruning and quantization, that may have a negative impact on the selectic metric. The maximum threshold value for accuracy loss is determined by the metric_drop_ths parameter. Read more in the [docs](https://nebuly.gitbook.io/nebuly/nebullvm/get-started).\n",
        "\n",
        "Let first test the optimization without accuracy loss (metric_drop_ths=0, default value), and then apply further accelerate it under the constrained of losing up to 2% of accuracy (metric = \"accuracy\", metric_drop_ths = 0.02)."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "skxEuemn171G"
      },
      "source": [
        "## Scenario 1 - No accuracy drop"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "wVRLXrDi2VaG"
      },
      "source": [
        "First of all we download the pretrained ONNX resnet50 model"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "6I5GDvWbZ-LJ",
        "outputId": "6ac09b39-9c6e-4d38-dfb6-35069938f9c1"
      },
      "outputs": [],
      "source": [
        "!wget https://github.com/onnx/models/raw/main/vision/classification/resnet/model/resnet50-v1-12.onnx"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "vrkOvGfkaXk7"
      },
      "source": [
        "Then we optimize it with Speedster simple API"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "2RbgGruAeQcf"
      },
      "outputs": [],
      "source": [
        "import numpy as np\n",
        "from speedster import optimize_model, save_model, load_model\n",
        "\n",
        "# Load a resnet as example\n",
        "model = \"resnet50-v1-12.onnx\"\n",
        "\n",
        "# Provide an input data for the model    \n",
        "input_data = [((np.random.randn(1, 3, 224, 224).astype(np.float32), ), np.array([0]))]\n",
        "\n",
        "# Run Speedster optimization\n",
        "optimized_model = optimize_model(\n",
        "  model, input_data=input_data, optimization_time=\"unconstrained\"\n",
        ")\n",
        "\n",
        "# Try the optimized model\n",
        "x = np.random.randn(1, 3, 224, 224).astype(np.float32)\n",
        "res_optimized = optimized_model(x)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "i2IKNc2jbax8"
      },
      "source": [
        "We can print the type of the optimized model to see which compiler was faster:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "dFhqAhr0bcbZ",
        "outputId": "aa0b2fe9-2fa0-405b-8e44-3ebbf70f0e69"
      },
      "outputs": [],
      "source": [
        "optimized_model"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_UuiqkEfcPy4"
      },
      "source": [
        "In our case, the optimized model type was NumpyONNXInferenceLearner, so this means that onnxruntime was the faster compiler.\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "E4759DQJcc15"
      },
      "source": [
        "After the optimization step, we can compare the optimized model with the baseline one in order to verify that the output is the same and to measure the speed improvement"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ktQaNfGqceOD"
      },
      "source": [
        "First of all, let's compute and print the original model result\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "gUMlNAZrcj5-",
        "outputId": "3670f41f-b2db-4b55-dbf7-c9b0a0146c9d"
      },
      "outputs": [],
      "source": [
        "import onnx\n",
        "import onnxruntime as ort\n",
        "from typing import Dict, List\n",
        "\n",
        "\n",
        "def get_input_names(onnx_model: str):\n",
        "    model = onnx.load(onnx_model)\n",
        "    input_all = [node.name for node in model.graph.input]\n",
        "    return input_all\n",
        "\n",
        "\n",
        "def get_output_names(onnx_model: str):\n",
        "    model = onnx.load(onnx_model)\n",
        "    output_all = [node.name for node in model.graph.output]\n",
        "    return output_all\n",
        "\n",
        "\n",
        "def run_onnx_model(\n",
        "    onnx_model: str, session: ort.InferenceSession, input_tensors: List[np.ndarray], inputs: Dict, output_names: str\n",
        ") -> List[np.ndarray]:\n",
        "    \n",
        "    res = session.run(\n",
        "        output_names=output_names, input_feed=inputs\n",
        "    )\n",
        "    return list(res)\n",
        "\n",
        "\n",
        "session = ort.InferenceSession(\n",
        "    model,\n",
        "    providers=[\"CUDAExecutionProvider\", \"CPUExecutionProvider\"] # Change to [\"CPUExecutionProvider\"] if run on cpu\n",
        ")\n",
        "\n",
        "inputs = {\n",
        "    name: array\n",
        "    for name, array in zip(get_input_names(model), [x])\n",
        "}\n",
        "\n",
        "res_original = run_onnx_model(model, session, [x], inputs, get_output_names(model))\n",
        "res_original"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "iU3dPwSTfWr_"
      },
      "source": [
        "Then, let's print the optimized model result that we computed before"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "S1EKoJ75fVAh",
        "outputId": "73e7b127-e7d3-44a9-bd78-65961bd051df"
      },
      "outputs": [],
      "source": [
        "res_optimized"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Lj4crPMmf_LX"
      },
      "source": [
        "Then, let's compute the average latency of the baseline model:\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "rGNKr_ShgBbu"
      },
      "outputs": [],
      "source": [
        "import time"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "I2G4OzhCgG_D",
        "outputId": "a23eb4ea-fa0f-4221-a177-20876e452b53"
      },
      "outputs": [],
      "source": [
        "num_iters = 100\n",
        "\n",
        "# Warmup\n",
        "for i in range(10):\n",
        "  run_onnx_model(model, session, [x], inputs, get_output_names(model))\n",
        "\n",
        "start = time.time()\n",
        "for i in range(num_iters):\n",
        "  run_onnx_model(model, session, [x], inputs, get_output_names(model))\n",
        "stop = time.time()\n",
        "\n",
        "print(\"Average latency original model: {:.4f} seconds\".format((stop - start) / num_iters))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "f-jmRjJvgW5V"
      },
      "source": [
        "Finally we compute the average latency for the optimized model:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "51c3uaMcgaR-",
        "outputId": "1319a7bc-df1d-4f19-9426-3940ab4a7c5e"
      },
      "outputs": [],
      "source": [
        "# Warmup\n",
        "for i in range(10):\n",
        "  optimized_model(x)\n",
        "\n",
        "start = time.time()\n",
        "for i in range(num_iters):\n",
        "  optimized_model(x)\n",
        "stop = time.time()\n",
        "\n",
        "print(\"Average latency optimized model: {:.4f} seconds\".format((stop - start) / num_iters))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "tBeRKNTI3iyK"
      },
      "source": [
        "## Scenario 2 - Accuracy drop"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "w3wutIzfAMe_"
      },
      "source": [
        "In this scenario, we set a max threshold for the accuracy drop to 2%"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "fO1nGqpj3p7z"
      },
      "outputs": [],
      "source": [
        "import numpy as np\n",
        "from speedster import optimize_model\n",
        "\n",
        "# Load a resnet as example\n",
        "model = \"resnet50-v1-12.onnx\"\n",
        "\n",
        "# Provide an input data for the model\n",
        "# Note that in this case we should provide the model at least 100 data samples\n",
        "input_data = [((np.random.randn(1, 3, 224, 224).astype(np.float32), ), np.array([0])) for i in range(100)]\n",
        "\n",
        "# Run nebullvm optimization\n",
        "optimized_model = optimize_model(\n",
        "  model, input_data=input_data, optimization_time=\"unconstrained\", metric = \"accuracy\", metric_drop_ths = 0.02\n",
        ")\n",
        "\n",
        "# Try the optimized model\n",
        "x = np.random.randn(1, 3, 224, 224).astype(np.float32)\n",
        "res_optimized = optimized_model(x)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4UFtwZbEiLv3"
      },
      "source": [
        "Here we compute the average throughput for the baseline model:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "qFKHaHM6-GKm",
        "outputId": "73b95996-4d1f-4aa7-a96d-a40070bf36bd"
      },
      "outputs": [],
      "source": [
        "num_iters = 100\n",
        "\n",
        "# Warmup\n",
        "for i in range(10):\n",
        "  run_onnx_model(model, session, [x], inputs, get_output_names(model))\n",
        "\n",
        "start = time.time()\n",
        "for i in range(num_iters):\n",
        "  run_onnx_model(model, session, [x], inputs, get_output_names(model))\n",
        "stop = time.time()\n",
        "\n",
        "print(\"Average latency original model: {:.4f} seconds\".format((stop - start) / num_iters))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "J8g0aJRJiXA5"
      },
      "source": [
        "Here we compute the average throughput for the optimized model:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "_IbAW0KA4Fm5",
        "outputId": "67f44401-9568-4f38-802a-d81e3139af5a"
      },
      "outputs": [],
      "source": [
        "# Warmup\n",
        "for i in range(10):\n",
        "  optimized_model(x)\n",
        "\n",
        "start = time.time()\n",
        "for i in range(num_iters):\n",
        "  optimized_model(x)\n",
        "stop = time.time()\n",
        "\n",
        "print(\"Average latency optimized model: {:.4f} seconds\".format((stop - start) / num_iters))"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "id": "ceb60d8c",
      "metadata": {
        "id": "ceb60d8c"
      },
      "source": [
        "## Save and reload the optimized model"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "id": "d9eda1a0",
      "metadata": {},
      "source": [
        "We can easily save to disk the optimized model with the following line:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "62b6fcbf",
      "metadata": {},
      "outputs": [],
      "source": [
        "save_model(optimized_model, \"model_save_path\")"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "id": "3c968d51",
      "metadata": {},
      "source": [
        "We can then load again the model:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "c1340c49",
      "metadata": {},
      "outputs": [],
      "source": [
        "optimized_model = load_model(\"model_save_path\")"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "id": "b77ff2ac",
      "metadata": {
        "id": "b77ff2ac"
      },
      "source": [
        "<center> \n",
        "    <a href=\"https://discord.com/invite/RbeQMu886J\" target=\"_blank\" style=\"text-decoration: none;\"> Join the community </a> |\n",
        "    <a href=\"https://nebuly.gitbook.io/nebuly/welcome/questions-and-contributions\" target=\"_blank\" style=\"text-decoration: none;\"> Contribute to the library </a>\n",
        "</center>\n",
        "\n",
        "<center> \n",
        "    <a href=\"https://github.com/nebuly-ai/nebullvm/tree/main/apps/accelerate/speedster#key-concepts\" target=\"_blank\" style=\"text-decoration: none;\"> How speedster works </a> •\n",
        "    <a href=\"https://github.com/nebuly-ai/nebullvm/tree/main/apps/accelerate/speedster#documentation\" target=\"_blank\" style=\"text-decoration: none;\"> Documentation </a> •\n",
        "    <a href=\"https://github.com/nebuly-ai/nebullvm/tree/main/apps/accelerate/speedster#quick-start\" target=\"_blank\" style=\"text-decoration: none;\"> Quick start </a> \n",
        "</center>"
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "collapsed_sections": [],
      "provenance": []
    },
    "gpuClass": "standard",
    "kernelspec": {
      "display_name": "Python 3.8.10 64-bit",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.8.9 (default, Apr 13 2022, 08:48:06) \n[Clang 13.1.6 (clang-1316.0.21.2.5)]"
    },
    "vscode": {
      "interpreter": {
        "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
